Which Words Matter in Defining Phrase Reorderings in Statistical Machine Translation?
نویسنده
چکیده
Lexicalized and hierarchical reordering models use relative frequencies of fully lexicalized phrase pairs to learn phrase reordering distributions. This results in unreliable estimation for infrequent phrase pairs which also tend to be longer phrases. There are some smoothing techniques used to smooth the distributions in these models. But these techniques are unable to address the similarities between phrase pairs and their reordering distributions. We propose two models to use shorter sub-phrase pairs of an original phrase pair to smooth the phrase reordering distributions. In the first model we follow the classic idea of backing off to shorter histories commonly used in language model smoothing. In the second model, we use syntactic dependencies to identify the most relevant words in a phrase to back off to. We show how these models can be easily applied to existing lexicalized and hierarchical reordering models. Our models achieve improvements of up to 0.40 BLEU points in Chinese-English translation compared to a baseline which uses a regular lexicalized reordering model and a hierarchical reordering model. The results show that not all the words inside a phrase pair are equally important in defining phrase reordering behavior and shortening towards important words will decrease the sparsity problem for long phrase pairs.
منابع مشابه
Learning Word Reorderings for Hierarchical Phrase-based Statistical Machine Translation
Statistical models for reordering source words have been used to enhance the hierarchical phrase-based statistical machine translation system. Existing word reordering models learn the reordering for any two source words in a sentence or only for two continuous words. This paper proposes a series of separate sub-models to learn reorderings for word pairs with different distances. Our experiment...
متن کاملA joint translation model with integrated reordering
This dissertation aims at combining the benefits and to remedy the flaws of the two popular frameworks in statistical machine translation, namely Phrasebased MT and N-gram-based MT. Phrase-based MT advanced the state-of-the art towards translating phrases3 than words. By memorizing phrases, phrasal MT, is able to learn local reorderings, and handling of other local dependencies such as insertio...
متن کاملHandling phrase reorderings for machine translation
We propose a distance phrase reordering model (DPR) for statistical machine translation (SMT), where the aim is to capture phrase reorderings using a structure learning framework. On both the reordering classification and a Chinese-to-English translation task, we show improved performance over a baseline SMT system.
متن کاملLearning Phrase Boundaries for Hierarchical Phrase-based Translation
Hierarchical phrase-based models provide a powerful mechanism to capture non-local phrase reorderings for statistical machine translation (SMT). However, many phrase reorderings are arbitrary because the models are weak on determining phrase boundaries for patternmatching. This paper presents a novel approach to learn phrase boundaries directly from word-aligned corpus without using any syntact...
متن کاملModified Distortion Matrices for Phrase-Based Statistical Machine Translation
This paper presents a novel method to suggest long word reorderings to a phrase-based SMT decoder. We address language pairs where long reordering concentrates on few patterns, and use fuzzy chunk-based rules to predict likely reorderings for these phenomena. Then we use reordered n-gram LMs to rank the resulting permutations and select the n-best for translation. Finally we encode these reorde...
متن کامل